Required current contributed CRAN packages:

I am running R 3.6.1, with recent update.packages().

needed <- c("raster", "rgdal", "RSQLite", "stars", "abind", "sp", "mapview",
"sf", "osmdata", "wordcloud", "RColorBrewer")

Script

Script and data at https://github.com/rsbivand/uw_191212/raw/master/uw_Rgeo_191212.zip. Download to suitable location, unzip and use as basis.

Script with local examples: https://github.com/rsbivand/uw_191212/raw/master/thurs191212.R

Preface/positionality

I am afraid that this talk will not contain many graphical elements. Why is this? Text is processed by different parts of our cognitive systems than images, which open for many more mis-understandings. Text is generally a more direct channel of communication.

Code is privileged text, it shows what is happening explicitly. It facilitates the reproduction of the workflow, and enables clear checking of robustness. Literate programming, and its modern form as notebooks that can be executed to reproduce outcomes, may seem hard to digest, but they enable the reader to become a co-creator of content.

It is a different kind of performance than a story, in that the reader may enter the workflow. In GUI-based stories, it is very possible that neither the author nor the reader can reconstruct the actual workflow yielding the presented outcomes.

Outline

Introduction and background: why break stuff (why not)?

Spatial data input/output and representation: movement from legacy sp to standards-compliant sf representation; coordinate reference systems; developments similar to GeoPandas and Shapely in Python

Opportunities in visualization (tmap, mapview), but also challenges when upstream software libraries evolve (PROJ/GDAL)

R-Spatial: getting to here and now

Background

In the early and mid 1990s, those of us who were teaching courses in spatial analysis beyond the direct application of geographical information systems (GIS) found the paucity of software limiting.

In institutions with funding for site licenses for GIS, it was possible to write or share scripts for Arc/Info (in AML), ArcView (in Avenue), or later in Visual Basic for ArcGIS.

If site licenses and associated dongles used in the field were a problem (including students involved in fieldwork in research projects), there were few alternatives, but opportunities were discussed on mailing lists.

From late 1996, the R programmimg language and environment began to be seen as an alternative for teaching and research involving spatial analysis.

R uses much of the syntax of S, then available commercially as S-Plus, but was and remains free to install, use and extend under the GNU General Public License (GPL).

In addition, it could be installed portably across multiple operating systems, including Windows and Apple MACOS.

At about the same time, the S-Plus SpatialStats module was published (Kaluzny et al. 1998), and a meeting occurred in Leicester to which many of those looking for solutions took part.

Much of the porting of S code to R for spatial statistics was begun by Albrecht Gebhardt as soon as the R package mechanism matured. Since teachers moving courses from S to R needed access to the S libraries previously used, porting was a crucial step.

CRAN listings show tripack (Renka and Gebhardt 2016) and akima (Akima and Gebhardt 2016) - both with non-open source licenses - available from August 1998 ported by Albrecht Gebhardt; ash and sgeostat (Majure and Gebhardt 2016) followed in April 1999.

The spatial package was available as part of MASS (Venables and Ripley 2002), also ported in part by Albrecht Gebhardt.

In the earliest period, CRAN administrators helped practically with porting and publication.

Albrecht and I presented an overview of possibilities of usin R for research and teaching in spatial analysis and statistics in August 1998 (Roger Bivand and Gebhardt 2000).

The S-PLUS version of splancs provided point pattern analysis (Rowlingson and Diggle 1993, 2017).

I had contacted Barry Rowlingson in 1997 but only moved forward with porting as R’s ability to load shared objects advanced.

In September 1998, I wrote to him: “It wasn’t at all difficult to get things running, which I think is a result of your coding, thank you!”

However, I added this speculation: “An issue I have thought about a little is whether at some stage Albrecht and I wouldn’t integrate or harmonize the points and pairs objects in splancs, spatial and sgeostat - they aren’t the same, but for users maybe they ought to appear to be so”.

This concern with class representations for geographical data turned out to be fruitful.

A further step was to link GRASS and R (Roger Bivand 2000), and followed up at several meetings and working closely with Markus Neteler.

The interface has evolved, and its almost current status is presented by (Lovelace, Nowosad, and Muenchow 2019), we return to the current status below.

A consequence of this work was that the CRAN team suggested that I attend a meeting in Vienna in early 2001 to talk about the GRASS GIS interface.

The meeting gave unique insights into the dynamics of R development, and very valuable contacts.

Later the same year Luc Anselin and Serge Rey asked me to take part in a workshop in Santa Barbara, which again led to many fruitful new contacts (Bivand 2006).

Further progress was made in spatial econometrics (Bivand 2002).

2003 Vienna workshop

During the second half of 2002, it seemed relevant to propose a spatial statistics paper session at the next Vienna meeting to be held in March 2003, together with a workshop to discuss classes for spatial data.

I had reached out to Edzer Pebesma as an author of the stand-alone open source program gstat (Pebesma and Wesseling 1998); it turned out that he had just been approached to wrap the program for S-Plus.

He saw the potential of the workshop immediately, and in November 2002 wrote in an email: “I wonder whether I should start writing S classes. I’m afraid I should.”

Virgilio Gómez-Rubio had been developing two spatial packages, RArcInfo (V. Gómez-Rubio and López-Quílez 2005; Gómez-Rubio 2011) and DCluster (V. Gómez-Rubio, Ferrándiz-Ferragud, and Lopez-Quílez 2005; Gómez-Rubio, Ferrándiz-Ferragud, and López-Quílez 2015), and was committed to participating.

Although he could not get to the workshop, Nicholas Lewin-Koh wrote in March 2003 that: “I was looking over all the DSC material, especially the spatial stuff. I did notice, after looking through peoples’ packages that there is a lot of duplication of effort. My suggestion is that we set up a repository for spatial packages similar to the Bioconductor mode, where we have a base spatial package that has S-4 based methods and classes that are efficient and general.”

Straight after the workshop, a collaborative repository for the development of software using SourceForge was established, and the R-sig-geo mailing list (still with over 3,500 subscribers) was created to facilitate interaction.

Beginnings of sp

So the mandate for the development of the sp package emerged in discussions between interested contributors before, during, and especially following the 2003 Vienna workshop.

Coding meetings were organized by Barry Rowlingson in Lancaster in November 2004 and by Virgilio Gómez-Rubio in Valencia in May 2005, at both of which the class definitions and implementations were stress-tested and often changed radically; the package was first published on CRAN in April 2005.

The underlying model adopted was for S4 (new-style) classes to be used, for "Spatial" objects, whether raster or vector, to behave like "data.frame" objects, and for visualization methods to make it easy to show the objects.

Relationships with other packages

From an early point in time, object conversion (known as coercion in S and R) to and from sp classes and classes in for example the spatstat package (Baddeley and Turner 2005; Baddeley, Rubak, and Turner 2015; Baddeley, Turner, and Rubak 2019).

Packages could choose whether they would use sp classes and methods directly, or rather use those classes for functionality that they did not provide themselves through coercion.

Reading and writing ESRI Shapefiles had been possible using the maptools package (Bivand and Lewin-Koh 2019) available from CRAN since August 2003, but rgdal, on CRAN from November 2003, initially only supported raster data read and written using the external GDAL library (Warmerdam 2008).

Further code contributions by Barry Rowlingson for handling projections using the external PROJ.4 library and the vector drivers in the then OGR part of GDAL were folded into rgdal, permitting reading into sp-objects and writing from sp-objects of vector and raster data.

Completing the sp-verse

For vector data it became possible to project coordinates, and in addition to transform them where datum specifications were available.

Until recently, the interfaces to external libraries GDAL and PROJ have been relatively stable, and upstream changes have not led to breaking changes for users of packages using sp classes or rgdal functionalities, although they have involved significant maintenance effort.

The final part of the framework for spatial vector data handling was the addition of the rgeos package interfacing the external GEOS library in 2011, thanks to Colin Rundell’s 2010 Google Summer of Coding project.

The rgeos package provided vector topological predicates and operations typically found in GIS such as intersection; note that by this time, both GDAL and GEOS used the Simple Features vector representation internally.

Applied Spatial Data Analysis with R (ASDAR, Springer: useR! series)

By the publication of ASDAR (Bivand, Pebesma, and Gomez-Rubio 2008), a few packages not written or maintained by the book authors and their nearest collaborators had begun to use sp classes. By the publication of the second edition (Bivand, Pebesma, and Gomez-Rubio 2013), we had seen that the number of packages depending on sp, importing from and suggesting it (in CRAN terminology for levels of dependency) had grown strongly. In late 2014, (de Vries 2014) looked at CRAN package clusters from a page rank graph, and found a clear spatial cluster that we had not expected. This cluster is from earlier this week:

Spatial data

Spatial data typically combine position data in 2D (or 3D), attribute data and metadata related to the position data. Much spatial data could be called map data or GIS data. We collect and handle much more position data since global navigation satellite systems (GNSS) like GPS came on stream 20 years ago, earth observation satellites have been providing data for longer.

When we began, most vector data had been digitised manually, with manual punching of attribute data, or was shared as downloaded files often as ESRI Shapefiles, the format introduced with ArcView. Nowadays, much data is available from online cloud sources, for example Open Street Map, here used to extract the light rail system in Bergen using the osmdata package. Setting up an Area of Interest (AoI), we query the data source for key values with two qualifications (those uploading the data were not consistent). These are then merged to give a single data object:

suppressPackageStartupMessages(library(osmdata))
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
bbox <- opq(bbox = 'bergen norway')
byb0 <- osmdata_sf(add_osm_feature(bbox, key = 'railway',
  value = 'light_rail'))$osm_lines
tram <- osmdata_sf(add_osm_feature(bbox, key = 'railway',
  value = 'tram'))$osm_lines
byb1 <- tram[!is.na(tram$name),]
o <- intersect(names(byb0), names(byb1))
byb <- rbind(byb0[,o], byb1[,o])
saveRDS(byb, file="byb.rds")

Spatial vector data is based on points, from which other geometries are constructed. Vector data is often also termed object-based spatial data. The light rail tracks are 2D vector data. The points themselves are stored as double precision floating point numbers, typically without recorded measures of accuracy (GNSS provides a measure of accuracy). Here, lines are constructed from points.

It has also become typical over the past few years to show such data objects in a pan/zoom interactive setting, here with the mapview package:

library(mapview)
mapview(byb)

Data handling

We can download monthly CSV files of city bike use, and manipulate the input to let us use the stplanr package to aggregate origin-destination data. One destination is in Oslo, some are round trips, but otherwise things are OK.

bike_fls <- list.files("bbs")
trips0 <- NULL
for (fl in bike_fls) trips0 <- rbind(trips0,
  read.csv(file.path("bbs", fl), header=TRUE))
trips0 <- trips0[trips0[, 8] < 6 & trips0[, 13] < 6,]
trips <- cbind(trips0[,c(1, 4, 2, 9)], data.frame(count=1))
from <- unique(trips0[,c(4,5,7,8)])
names(from) <- substring(names(from), 7)
to <- unique(trips0[,c(9,10,12,13)])
names(to) <- substring(names(to), 5)
# find origins and destinations and merge them
stations0 <- st_as_sf(merge(from, to, all=TRUE),
  coords=c("station_longitude", "station_latitude"))
stations <- aggregate(stations0, list(stations0$station_id),
  head, n=1)
suppressWarnings(stations <- st_cast(stations, "POINT"))
st_crs(stations) <- 4326
# sum trips between origins and destinations
od <- aggregate(trips[,-(1:4)], list(trips$start_station_id,
  trips$end_station_id), sum)

We can use the stplanr package to generate the station-to-station desire lines between different stations, omitting bikes dropped at the same station where they were picked up:

# drop equal origins and destinations
od <- od[-(which(od[,1] == od[,2])),]
# create desire lines
library(stplanr)
od_lines <- od2line(flow=od, zones=stations, zone_code="Group.1",
  origin_code="Group.1", dest_code="Group.2")
saveRDS(od_lines, "od_lines.rds")

Origin-destination lines

od_lines <- readRDS("od_lines.rds")
mapview(od_lines, alpha=0.2, lwd=(od_lines$x/max(od_lines$x))*10)

We can use CycleStreets to route the volumes onto OSM cycle paths, via an API and API key.

# allocate desirelines to cycle paths using CycleStreet
Sys.setenv(CYCLESTREET="XxXxXxXxXxXxXxXx")
od_routes <- line2route(od_lines, "route_cyclestreet",
  plan = "fastest")
saveRDS(od_routes, "od_routes.rds")

Routed lines along cycle routes

od_routes <- readRDS("od_routes.rds")
mapview(od_routes, alpha=0.2, lwd=(od_lines$x/max(od_lines$x))*10)

We’d still need to aggregate the bike traffic by cycle path segment for completeness,

Advancing from the sp representation

Representing spatial vector data in R (sp)

The sp package was a child of its time, using S4 formal classes, and the best compromise we then had of positional representation (not arc-node, but hard to handle holes in polygons). If we coerse byb to the sp representation, we see the formal class structure. Input/output used OGR/GDAL vector drivers in the rgdal package, and topological operations used GEOS in the rgeos package.

library(sp)
byb_sp <- as(byb, "Spatial")
str(byb_sp, max.level=2)
## Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots
##   ..@ data       :'data.frame':  189 obs. of  10 variables:
##   ..@ lines      :List of 189
##   ..@ bbox       : num [1:2, 1:2] 5.23 60.29 5.36 60.39
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
str(slot(byb_sp, "lines")[[1]])
## Formal class 'Lines' [package "sp"] with 2 slots
##   ..@ Lines:List of 1
##   .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. ..@ coords: num [1:14, 1:2] 5.33 5.33 5.33 5.33 5.33 ...
##   .. .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. .. ..$ : chr [1:14] "4870663682" "331531217" "331531216" "331531215" ...
##   .. .. .. .. .. ..$ : chr [1:2] "lon" "lat"
##   ..@ ID   : chr "30098967"

Raster data

Spatial raster data is observed using rectangular (often square) cells, within which attribute data are observed. Raster data are very rarely object-based, very often they are field-based and could have been observed everywhere. We probably do not know where within the raster cell the observed value is correct; all we know is that at the chosen resolution, this is the value representing the whole cell area.

Elevation data may be accessed from the AWS cloud using the elevatr package:

library(elevatr)
elevation <- get_elev_raster(byb_sp, z = 10)
is.na(elevation) <- elevation < 1
saveRDS(elevation, file="elevation.rds")

The native import format is that of the raster package building on sp:

elevation <- readRDS("elevation.rds")
str(elevation, max.level=2)
## Formal class 'RasterLayer' [package "raster"] with 12 slots
##   ..@ file    :Formal class '.RasterFile' [package "raster"] with 13 slots
##   ..@ data    :Formal class '.SingleLayerData' [package "raster"] with 13 slots
##   ..@ legend  :Formal class '.RasterLegend' [package "raster"] with 5 slots
##   ..@ title   : chr(0) 
##   ..@ extent  :Formal class 'Extent' [package "raster"] with 4 slots
##   ..@ rotated : logi FALSE
##   ..@ rotation:Formal class '.Rotation' [package "raster"] with 2 slots
##   ..@ ncols   : int 1545
##   ..@ nrows   : int 1043
##   ..@ crs     :Formal class 'CRS' [package "sp"] with 1 slot
##   ..@ history : list()
##   ..@ z       : list()

The sp representation is:

str(as(elevation, "SpatialGridDataFrame"), max.level=2)
## Formal class 'SpatialGridDataFrame' [package "sp"] with 4 slots
##   ..@ data       :'data.frame':  1611435 obs. of  1 variable:
##   ..@ grid       :Formal class 'GridTopology' [package "sp"] with 3 slots
##   ..@ bbox       : num [1:2, 1:2] 4.92 60.06 5.98 60.42
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot

and the mapped values:

mapview(elevation, col=terrain.colors)

Raster data

The raster package complemented sp for handling raster objects and their interactions with vector objects.

It added to input/output using GDAL through rgdal, and better access to NetCDF files for GDAL built without the relevant drivers.

It may be mentioned in passing that thanks to help from CRAN administrators and especially Brian Ripley, CRAN binary builds of rgdal for Windows and Apple Mac OSX became available from 2006, but with a limited set of vector and raster drivers.

Support from CRAN adminstrators remains central to making packages available to users who are not able to install R source packages themselves, particularly linking to external libraries.

Initially, raster was written in R using functionalities in sp and rgdal with rgeos coming later.

It used a feature of GDAL raster drivers permitting the successive reading of subsets of rasters by row and column, permitting the processing of much larger objects than could be held in memory.

In addition, the concepts of bricks and stacks of rasters were introduced, diverging somewhat from the sp treatment of raster bands as stacked columns as vectors in a data frame.

Questions arose

As raster evolved, two other packages emerged raising issues with the ways in which spatial objects had been conceptualized in sp.

The rgeos package used the C application programming interface (API) to the C++ GEOS library, which is itself a translation of the Java Topology Suite (JTS).

While the GDAL vector drivers did use the standard Simple Features representation of ector geometries, it was not strongly enforced.

This laxity now seems most closely associated with the use of ESRI Shapefiles as a de-facto file standard for representation, in which many Simple Features are not consistently representable.

Need for vector standards compliance

Both JTS and GEOS required a Simple Feature compliant representation, and led to the need for curious and fragile adaptations.

For example, these affected the representation of sp "Polygons" objects, which were originally conceptualized after the Shapefile specification: ring direction determined whether a ring was exterior or interior (a hole), but no guidance was given to show which exterior ring holes might belong to.

As R provides a way to add a character string comment to any object, comments were added to each "Polygons" object encoding the necessary information.

In this way, GEOS functionality could be used, but the fragility of vector representation in sp was made very obvious.

Spatio-temporal data

Another package affecting thinking about representation was spacetime, as it diverged from raster by stacking vectors for regular spatio-temporal objects with space varying faster than time.

So a single earth observation band observed repeatedly would be stored in a single vector in a data frame, rather than in the arguably more robust form of a four-dimensional array, with the band taking one position on the final dimension.

The second edition of (Bivand, Pebesma, and Gomez-Rubio 2013) took up all of these issues in one way or another, but after completing a spatial statistics special issue of the Journal of Statistical Software (Pebesma, Bivand, and Ribeiro 2015), it was time to begin fresh implementations of classes for spatial data.

Simple Features in R

It was clear that vector representations needed urgent attention, so the sf package was begun, aiming to implement the most frequently used parts of the specification (ISO 2004b; Kralidis 2008; Herring 2011).

Development was supported by a grant from the then newly started R Consortium, which brings together R developers and industry members.

However, although data frame objects in S and R have always been able to take list columns as valid columns, such list columns were not seen as “tidy” (Wickham 2014).

A key breakthrough came at the useR! 2016 conference, following an earlier decision to re-base vector objects on data frames, rather than as in sp to embed a data frame inside a collection of spatial features of the same kind.

sf begins

(Pebesma 2018) shows the status of the sf towards the end of 2017, with a geometry list column containing R wrappers around objects adhering to Simple Features specification definitions. The feature geometries are stored in numeric vectors, matrices, or lists of matrices, and may also be subject to arithmetic operations. Features are held in the "XY" class if two-dimensional, or "XYZ", "XYM" or "XYZM" if such coordinates are available; all single features are "sfg" (Simple Feature geometry) objects:

pt1 <- st_point(c(1,3))
pt2 <- pt1 + 1
pt3 <- pt2 + 1
str(pt3)
##  'XY' num [1:2] 3 5

Geometries may be represented as “Well Known Text” (WKT):

st_as_text(pt3)
## [1] "POINT (3 5)"

or as “Well Known Binary” (WKB) as in databases’ “binary large objects” (BLOBs), resolving the problem of representation when working with GDAL vector drivers and functions, and with GEOS predicates and topological operations:

st_as_binary(pt3)
##  [1] 01 01 00 00 00 00 00 00 00 00 00 08 40 00 00 00 00 00 00 14 40

A column of simple feature geometries ("sfc") is constructed as a list of "sfg" objects, which do not have to belong to the same Simple Features category:

pt_sfc <- st_as_sfc(list(pt1, pt2, pt3))
str(pt_sfc)
## sfc_POINT of length 3; first list element:  'XY' num [1:2] 1 3

Finally, an "sfc" object, a geometry column, can be added to a data.frame object using st_geometry(), which sets a number of attributes on the object and defines it as also being an "sf" object (the "agg" attribute if populated shows how observations on non-geometry columns should be understood):

Data frames are lists of vector components:

V1 <- 1:3
V2 <- letters[1:3]
V3 <- sqrt(V1)
V4 <- sqrt(as.complex(-V1))
L <- list(v1=V1, v2=V2, v3=V3, v4=V4)

Let’s create a data frame from this list, ending up with strings not treated as categorical variables:

DF <- as.data.frame(L)
str(DF)
## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i
DF <- as.data.frame(L, stringsAsFactors=FALSE)
str(DF)
## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: chr  "a" "b" "c"
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i

Tidy list columns

Data frame list columns have existed for a long time:

DF$E <- list(d=1, e="1", f=TRUE)
str(DF)
## 'data.frame':    3 obs. of  5 variables:
##  $ v1: int  1 2 3
##  $ v2: chr  "a" "b" "c"
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i
##  $ E :List of 3
##   ..$ d: num 1
##   ..$ e: chr "1"
##   ..$ f: logi TRUE

At useR! in 2016, list columns were declared “tidy”, using examples including the difficulty of encoding polygon interior rings in non-list columns. The decision to accommodate “tidy” workflows as well as base-R workflows had already been made, as at least some users only know how to use so-called `tidy'' workflows. We add the“sfc”object to the data frame assigning throughst_geometry()`:

st_geometry(DF) <- pt_sfc
(DF)
## Simple feature collection with 3 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 1 ymin: 3 xmax: 3 ymax: 5
## epsg (SRID):    NA
## proj4string:    NA
##   v1 v2       v3          v4    E    geometry
## 1  1  a 1.000000 0+1.000000i    1 POINT (1 3)
## 2  2  b 1.414214 0+1.414214i    1 POINT (2 4)
## 3  3  c 1.732051 0+1.732051i TRUE POINT (3 5)

The sf package does not implement all of the Simple Features geometry categories, but geometries may be converted to the chosen subset, using for example the gdal_utils() function with util="ogr2ogr", options="-nlt CONVERT_TO_LINEAR" to convert curve geometries in an input file to linear geometries.

Many of the functions in the sf package begin with st_ as a reference to the same usage in PostGIS, where the letters were intended to symbolise space and time, but where time has not yet been implemented.

sf also integrates GEOS topological predicates and operations into the same framework, replacing the rgeos package for access to GEOS functionality. The precision and scale defaults differ between sf and rgeos slightly; both remain fragile with respect to invalid geometries, of which there are many in circulation.

(buf_DF <- st_buffer(DF, dist=0.3))
## Simple feature collection with 3 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 0.7 ymin: 2.7 xmax: 3.3 ymax: 5.3
## epsg (SRID):    NA
## proj4string:    NA
##   v1 v2       v3          v4    E                       geometry
## 1  1  a 1.000000 0+1.000000i    1 POLYGON ((1.3 3, 1.299589 2...
## 2  2  b 1.414214 0+1.414214i    1 POLYGON ((2.3 4, 2.299589 3...
## 3  3  c 1.732051 0+1.732051i TRUE POLYGON ((3.3 5, 3.299589 4...
plot(st_geometry(buf_DF))
plot(st_geometry(DF), add=TRUE)

Raster representations: stars

Like sf, stars was supported by an R Consortium grant, for scalable, spatio-temporal tidy arrays for R.

Spatio-temporal arrays were seen as an alternative way of representing multivariate spatio-temporal data from the choices made in the spacetime package, where a two-dimensional data frame contained stacked positions within stacked time points or intervals.

The proposed arrays might collapse to a raster layer if only one variable was chosen for one time point or interval.

More important, the development of the package was extended to accommodate a backend for earth data processing in which the data are retrieved and rescaled as needed from servers, most often cloud-based servers.

This example only covers a multivariate raster taken from a Landsat 7 view of a small part of the Brazilian coast. In the first part, a GeoTIFF file is read into memory, using three array dimensions, two in planar space, the third across six bands:

library(stars)
## Loading required package: abind
fn <- system.file("tif/L7_ETMs.tif", package = "stars")
L7 <- read_stars(fn)
L7
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##   L7_ETMs.tif    
##  Min.   :  1.00  
##  1st Qu.: 54.00  
##  Median : 69.00  
##  Mean   : 68.91  
##  3rd Qu.: 86.00  
##  Max.   :255.00  
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL

The bands can be operated on arithmetically, for example to generate a new object containing values of the normalized difference vegetation index through a function applied across the \(x\) and \(y\) spatial dimensions:

ndvi <- function(x) (x[4] - x[3])/(x[4] + x[3])
(s2.ndvi <- st_apply(L7, c("x", "y"), ndvi))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##      ndvi          
##  Min.   :-0.75342  
##  1st Qu.:-0.20301  
##  Median :-0.06870  
##  Mean   :-0.06432  
##  3rd Qu.: 0.18667  
##  Max.   : 0.58667  
## dimension(s):
##   from  to  offset delta                       refsys point values    
## x    1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y    1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]

The same file can also be accessed using the proxy mechanism, shich creates a link to the external entity, here a file:

L7p <- read_stars(fn, proxy=TRUE)
L7p
## stars_proxy object with 1 attribute in file:
## $L7_ETMs.tif
## [1] "/home/rsb/lib/r_libs/stars/tif/L7_ETMs.tif"
## 
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL

The same function can also be applied across the same two spatial dimentions of the array, but no calculation is carried out until the data is needed and the output resolution known:

(L7p.ndvi = st_apply(L7p, c("x", "y"), ndvi))
## stars_proxy object with 1 attribute in file:
## $L7_ETMs.tif
## [1] "/home/rsb/lib/r_libs/stars/tif/L7_ETMs.tif"
## 
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL    
## call list:
## [[1]]
## st_apply(X = X, MARGIN = c("x", "y"), FUN = ndvi)

The array object can also be split, here on the band dimension, to yield a representation as six rasters in list form:

gdalcubes

Earth Observation Data Cubes from Satellite Image Collections - extension of the stars proxy mechansim and the raster out-of-memory approach: (https://github.com/appelmar/gdalcubes_R).

Processing collections of Earth observation images as on-demand multispectral, multitemporal data cubes. Users define cubes by spatiotemporal extent, resolution, and spatial reference system and let ‘gdalcubes’ automatically apply cropping, reprojection, and resampling using the ‘Geospatial Data Abstraction Library’ (‘GDAL’).

Implemented functions on data cubes include reduction over space and time, applying arithmetic expressions on pixel band values, moving window aggregates over time, filtering by space, time, bands, and predicates on pixel values, materializing data cubes as ‘netCDF’ files, and plotting. User-defined ‘R’ functions can be applied over chunks of data cubes. The package implements lazy evaluation and multithreading. See also a five month old blog post.

Input/output

The sf, gdalcubes, lwgeom, rgdal and rgeos packages crucially depend on external software, using GDAL for input and output:

sf_extSoftVersion()
##           GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
##        "3.8.0"        "3.0.2"        "6.2.1"         "true"         "true"

While sp handed off dependencies to interfaces to external software GEOS (rgeos) and GDAL+PROJ (rgdal), sf includes all the external dependencies itself. This also means that stars needs sf to provide raster drivers (some other packages like gdalcubes themselves link to GDAL).

sort(as.character(st_drivers("vector")$name))
##  [1] "AeronavFAA"     "AmigoCloud"     "ARCGEN"         "AVCBin"        
##  [5] "AVCE00"         "BNA"            "CAD"            "Carto"         
##  [9] "Cloudant"       "CouchDB"        "CSV"            "CSW"           
## [13] "DGN"            "DXF"            "EDIGEO"         "EEDA"          
## [17] "ElasticSearch"  "ESRI Shapefile" "ESRIJSON"       "Geoconcept"    
## [21] "GeoJSON"        "GeoJSONSeq"     "Geomedia"       "GeoRSS"        
## [25] "GFT"            "GML"            "GMLAS"          "GPKG"          
## [29] "GPSBabel"       "GPSTrackMaker"  "GPX"            "HTF"           
## [33] "HTTP"           "Idrisi"         "Interlis 1"     "Interlis 2"    
## [37] "JML"            "JP2OpenJPEG"    "KML"            "MapInfo File"  
## [41] "MBTiles"        "Memory"         "MSSQLSpatial"   "MVT"           
## [45] "NAS"            "netCDF"         "NGW"            "ODBC"          
## [49] "ODS"            "OGR_GMT"        "OGR_PDS"        "OGR_SDTS"      
## [53] "OGR_VRT"        "OpenAir"        "OpenFileGDB"    "OSM"           
## [57] "PCIDSK"         "PDF"            "PDS4"           "PGDUMP"        
## [61] "PGeo"           "PLSCENES"       "PostgreSQL"     "REC"           
## [65] "S57"            "SEGUKOOA"       "SEGY"           "Selafin"       
## [69] "SQLite"         "SUA"            "SVG"            "SXF"           
## [73] "TIGER"          "TopoJSON"       "UK .NTF"        "VDV"           
## [77] "VFK"            "Walk"           "WAsP"           "WFS"           
## [81] "WFS3"           "XLS"            "XLSX"           "XPlane"

The drivers provided by GDAL can (mostly) read from data formatted as described for the drivers, and can to a lesser extent write data out. Raster access can use spatial subsets of the data extent, something that is harder to do with vector. Proxy handling is similarly largely restricted to raster drivers.

sort(as.character(st_drivers("raster")$name))
##   [1] "AAIGrid"             "ACE2"                "ADRG"               
##   [4] "AIG"                 "AirSAR"              "ARG"                
##   [7] "BAG"                 "BIGGIF"              "BLX"                
##  [10] "BMP"                 "BSB"                 "BT"                 
##  [13] "BYN"                 "CAD"                 "CALS"               
##  [16] "CEOS"                "COASP"               "COSAR"              
##  [19] "CPG"                 "CTable2"             "CTG"                
##  [22] "DAAS"                "DERIVED"             "DIMAP"              
##  [25] "DIPEx"               "DOQ1"                "DOQ2"               
##  [28] "DTED"                "E00GRID"             "ECRGTOC"            
##  [31] "EEDAI"               "EHdr"                "EIR"                
##  [34] "ELAS"                "ENVI"                "ERS"                
##  [37] "ESAT"                "FAST"                "FIT"                
##  [40] "FujiBAS"             "GenBin"              "GFF"                
##  [43] "GIF"                 "GMT"                 "GPKG"               
##  [46] "GRASSASCIIGrid"      "GRIB"                "GS7BG"              
##  [49] "GSAG"                "GSBG"                "GSC"                
##  [52] "GTiff"               "GTX"                 "GXF"                
##  [55] "HDF5"                "HDF5Image"           "HF2"                
##  [58] "HFA"                 "HTTP"                "IDA"                
##  [61] "IGNFHeightASCIIGrid" "ILWIS"               "INGR"               
##  [64] "IRIS"                "ISCE"                "ISIS2"              
##  [67] "ISIS3"               "JAXAPALSAR"          "JDEM"               
##  [70] "JP2OpenJPEG"         "JPEG"                "KMLSUPEROVERLAY"    
##  [73] "KRO"                 "L1B"                 "LAN"                
##  [76] "LCP"                 "Leveller"            "LOSLAS"             
##  [79] "MAP"                 "MBTiles"             "MEM"                
##  [82] "MFF"                 "MFF2"                "MRF"                
##  [85] "MSGN"                "NDF"                 "netCDF"             
##  [88] "NGSGEOID"            "NGW"                 "NITF"               
##  [91] "NTv1"                "NTv2"                "NWT_GRC"            
##  [94] "NWT_GRD"             "OZI"                 "PAux"               
##  [97] "PCIDSK"              "PCRaster"            "PDF"                
## [100] "PDS"                 "PDS4"                "PLMOSAIC"           
## [103] "PLSCENES"            "PNG"                 "PNM"                
## [106] "PostGISRaster"       "PRF"                 "R"                  
## [109] "Rasterlite"          "RDA"                 "RIK"                
## [112] "RMF"                 "ROI_PAC"             "RPFTOC"             
## [115] "RRASTER"             "RS2"                 "RST"                
## [118] "SAFE"                "SAGA"                "SAR_CEOS"           
## [121] "SDTS"                "SENTINEL2"           "SGI"                
## [124] "SIGDEM"              "SNODAS"              "SRP"                
## [127] "SRTMHGT"             "Terragen"            "TIL"                
## [130] "TSX"                 "USGSDEM"             "VICAR"              
## [133] "VRT"                 "WCS"                 "WEBP"               
## [136] "WMS"                 "WMTS"                "XPM"                
## [139] "XYZ"                 "ZMap"

There are clear preferences among data providers and users for particular data formats, so some drivers get more exposure than others. For vector data, many still use "ESRI Shapefile", although its geometries are not SF-compliant, and data on features are stored in variant DBF files (text tiles, numerically imprecise, field name length restrictions, encoding issues). "geojson" and "GML" are text files with numeric imprecision in coordinates as well as data fields. Among vector drivers, "GPKG" is a viable standard and should be used as far as possible. This format is a file SQLite database, with multiple linked tables:

library(RSQLite)
db = dbConnect(SQLite(), dbname="snow/b_pump.gpkg")
dbListTables(db)
##  [1] "b_pump"                   "gpkg_contents"           
##  [3] "gpkg_extensions"          "gpkg_geometry_columns"   
##  [5] "gpkg_ogr_contents"        "gpkg_spatial_ref_sys"    
##  [7] "gpkg_tile_matrix"         "gpkg_tile_matrix_set"    
##  [9] "rtree_b_pump_geom"        "rtree_b_pump_geom_node"  
## [11] "rtree_b_pump_geom_parent" "rtree_b_pump_geom_rowid" 
## [13] "sqlite_sequence"

The geometry column table has a clear structure:

str(dbReadTable(db, "gpkg_geometry_columns"))
## 'data.frame':    1 obs. of  6 variables:
##  $ table_name        : chr "b_pump"
##  $ column_name       : chr "geom"
##  $ geometry_type_name: chr "POINT"
##  $ srs_id            : int 100000
##  $ z                 : int 0
##  $ m                 : int 0

Reading the "geom" component yields a 2D point; in this case there is one feature only, with data in a BLOB:

str(dbReadTable(db, "b_pump")$geom)
## blob [1:1] 
## $ : raw [1:29] 47 50 00 01 ...
## @ ptype: raw(0)
dbDisconnect(db)

We can query the layers in a "GPKG" data source: here the locations of the Broad Street pump and the other pumps in the Snow Soho Cholera epidemic data set:

st_layers("snow/b_pump.gpkg")
## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields
## 1     b_pump         Point        1      1
st_layers("snow/nb_pump.gpkg")
## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields
## 1    nb_pump         Point       11      1

On the sp side, we could retrieve information about a data source, vector:

library(rgdal)
## rgdal: version: 1.5-2, (SVN revision 896)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 3.0.2, released 2019/10/28
##  Path to GDAL shared files: /usr/local/share/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 6.2.1, November 1st, 2019, [PJ_VERSION: 621]
##  Path to PROJ.4 shared files: /usr/local/share/proj
##  Linking to sp version: 1.3-3
ogrInfo("snow/nb_pump.gpkg")
## Source: "/home/rsb/presentations/uw19/snow/nb_pump.gpkg", layer: "nb_pump"
## Driver: GPKG; number of rows: 11 
## Feature type: wkbPoint with 2 dimensions
## Extent: (529180.4 180665.9) - (529747.4 181359)
## CRS: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs 
## Number of fields: 1 
##   name type length  typeName
## 1  cat   12      0 Integer64

or raster:

rgdal::GDALinfo(fn)
## rows        352 
## columns     349 
## bands       6 
## lower left origin.x        288776.3 
## lower left origin.y        9110729 
## res.x       28.5 
## res.y       28.5 
## ysign       -1 
## oblique.x   0 
## oblique.y   0 
## driver      GTiff 
## projection  +proj=utm +zone=25 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m
## +no_defs 
## file        /home/rsb/lib/r_libs/stars/tif/L7_ETMs.tif 
## apparent band summary:
##   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
## 1   Byte          FALSE           0          3        349
## 2   Byte          FALSE           0          3        349
## 3   Byte          FALSE           0          3        349
## 4   Byte          FALSE           0          3        349
## 5   Byte          FALSE           0          3        349
## 6   Byte          FALSE           0          3        349
## apparent band statistics:
##   Bmin Bmax Bmean Bsd
## 1    0  255    NA  NA
## 2    0  255    NA  NA
## 3    0  255    NA  NA
## 4    0  255    NA  NA
## 5    0  255    NA  NA
## 6    0  255    NA  NA
## Metadata:
## AREA_OR_POINT=Area

All of these facilities are taken from GDAL; the raster facilities have been extant for many years. raster used the ease of subsetting to permit large rasters to be handled out-of-memory.

Summary: sf::st_read() and rgdal::readOGR() are equivalent, as are sf::st_write() and rgdal::writeOGR(). When writing, you may need to take steps if overwriting. rgdal::readGDAL() reads the raster data (sub)set into an sp object, stars::read_stars() reads into a possibly proxy stars object, and raster can also be used:

library(raster)
(r <- raster(fn))
## class      : RasterLayer 
## band       : 1  (of  6  bands)
## dimensions : 352, 349, 122848  (nrow, ncol, ncell)
## resolution : 28.5, 28.5  (x, y)
## extent     : 288776.3, 298722.8, 9110729, 9120761  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=25 +south +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source     : /home/rsb/lib/r_libs/stars/tif/L7_ETMs.tif 
## names      : L7_ETMs 
## values     : 0, 255  (min, max)

Output: rgdal::writeGDAL(), stars::write_stars() or raster::writeRaster() may be used for writing, but what happens depends on details, such as storage formats. Unlike vector, most often storage formats will be taken as homogeneous by type.

Tiled representations

While interactive web mapping interfaces use raster or vector tiled backgrounds, we have not (yet) approached tiles or pyramids internally.

Coordinate reference systems

Background

The usefulness of spatial data is linked to knowing its coordinate reference system. The coordinate reference system may be geographic, usually measured in decimal degrees, or projected, layered on a known geographic CRS, usually measured in metres (planar). The underlying geographical CRS must specify an ellipsoid, with associated major and minor axis lengths:

library(sp)
library(rgdal)
projInfo("ellps")[,c(1, 4)]
##         name                     description
## 1      MERIT                      MERIT 1983
## 2      SGS85       Soviet Geodetic System 85
## 3      GRS80            GRS 1980(IUGG, 1980)
## 4      IAU76                        IAU 1976
## 5       airy                       Airy 1830
## 6     APL4.9             Appl. Physics. 1965
## 7      NWL9D        Naval Weapons Lab., 1965
## 8   mod_airy                   Modified Airy
## 9     andrae      Andrae 1876 (Den., Iclnd.)
## 10    danish  Andrae 1876 (Denmark, Iceland)
## 11   aust_SA Australian Natl & S. Amer. 1969
## 12     GRS67               GRS 67(IUGG 1967)
## 13   GSK2011                        GSK-2011
## 14    bessel                     Bessel 1841
## 15  bess_nam           Bessel 1841 (Namibia)
## 16    clrk66                     Clarke 1866
## 17    clrk80                Clarke 1880 mod.
## 18 clrk80ign              Clarke 1880 (IGN).
## 19       CPM Comm. des Poids et Mesures 1799
## 20    delmbr         Delambre 1810 (Belgium)
## 21   engelis                    Engelis 1985
## 22   evrst30                    Everest 1830
## 23   evrst48                    Everest 1948
## 24   evrst56                    Everest 1956
## 25   evrst69                    Everest 1969
## 26   evrstSS       Everest (Sabah & Sarawak)
## 27   fschr60    Fischer (Mercury Datum) 1960
## 28  fschr60m           Modified Fischer 1960
## 29   fschr68                    Fischer 1968
## 30   helmert                    Helmert 1906
## 31     hough                           Hough
## 32      intl    International 1909 (Hayford)
## 33     krass                Krassovsky, 1942
## 34     kaula                      Kaula 1961
## 35     lerch                      Lerch 1979
## 36     mprts                 Maupertius 1738
## 37  new_intl          New International 1967
## 38   plessis           Plessis 1817 (France)
## 39      PZ90                           PZ-90
## 40    SEasia                  Southeast Asia
## 41   walbeck                         Walbeck
## 42     WGS60                          WGS 60
## 43     WGS66                          WGS 66
## 44     WGS72                          WGS 72
## 45     WGS84                          WGS 84
## 46    sphere       Normal Sphere (r=6370997)

Other parameters should be specified, such as the prime meridian, often taken as Greenwich. Before PROJ version 6, legacy PROJ (and GDAL) used a +datum= tag introduced after the library migrated beyond USGS (around version 4.4). The underlying problem was not that projection and inverse projection could not be carried out between projected CRS and geograpghical CRS, but that national mapping agencies defined often many datums, keying the specification of a geographical CRS to a national or regional datum. Some of these, especially for North America, were supported, but support for others was patchy. The +datum= tag supported a partly informal listing of values, themselves linked to three or seven coefficient datum transformation sets, used through the +towgs84= tag. Coefficient lookup through the +datum= tag, or direct specification of coefficients through the +towgs84= tag became a convenient way to handle datum transformation in addition to projection and inverse projection.

The default “hub” for transformation was to go through the then newly achieved WGS84 datum. Spatial data files often encoded the geographic and projected CRS with reference to these values, in some cases using PROJ 4 strings. These used a pseudo projection +proj=longlat to indicate a geographical CRS, and many other possible values of +proj= for projected CRS.

The Grids & Datums column in Photogrammetric Engineering & Remote Sensing gives insight into some of the peculiarities of national mapping agencies - authority is typically national but may be subnational:

data("GridsDatums")
GridsDatums[grep("Norway", GridsDatums$country),]
##                   country     month year ISO
## 17  The Kingdom of Norway (October) 1999 NOR
## 226                Norway    (July) 2017 NOR

Beyond this, the database successively developed by the European Petroleum Survey Group was copied to local CSV files for PROJ and GDAL, providing lookup by code number.

EPSG <- make_EPSG()
EPSG[grep("Oslo", EPSG$note), 1:2]
##       code                            note
## 515   4817                 NGO 1948 (Oslo)
## 2408 27391    NGO 1948 (Oslo) / NGO zone I
## 2409 27392   NGO 1948 (Oslo) / NGO zone II
## 2410 27393  NGO 1948 (Oslo) / NGO zone III
## 2411 27394   NGO 1948 (Oslo) / NGO zone IV
## 2412 27395    NGO 1948 (Oslo) / NGO zone V
## 2413 27396   NGO 1948 (Oslo) / NGO zone VI
## 2414 27397  NGO 1948 (Oslo) / NGO zone VII
## 2415 27398 NGO 1948 (Oslo) / NGO zone VIII

We can use CRS() to instantiate the coordinate reference system:

(o <-CRS("+init=epsg:4817"))
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum NGO_1948_Oslo in CRS definition
## CRS arguments:
##  +proj=longlat +a=6377492.018 +rf=299.1528128 +pm=oslo +no_defs

The lookup prior to PROJ 6 used to provide a +towgs84= value of 278.3,93,474.5,7.889,0.05,-6.61,6.21, but in the new regime only reveals transformation coefficients in the context of a coordinate operation (only in the unreleased development version of rgdal), and not from the PROJ string returned, which has only ballpark accuracy:

list_coordOps("+proj=longlat +a=6377492.018 +rf=299.1528128 +pm=oslo +no_defs +type=crs", "EPSG:4326")
## Candidate coordinate operations found:  1 
## Strict containment:  FALSE 
## Visualization order:  TRUE 
## Source: +proj=longlat +a=6377492.018 +rf=299.1528128 +pm=oslo +no_defs
##         +type=crs
## Target: EPSG:4326 
## Best instantiable operation has only ballpark accuracy 
## Description: Transformation from unknown to unknown altered to use prime
##              meridian of WGS 84 + Ballpark geographic offset
##              from unknown altered to use prime meridian of WGS
##              84 to WGS 84 + axis order change (2D)
## Definition:  +proj=pipeline +step +proj=unitconvert +xy_in=deg +xy_out=rad
##              +step +inv +proj=longlat +a=6377492.018
##              +rf=299.1528128 +pm=oslo +step +proj=unitconvert
##              +xy_in=rad +xy_out=deg

Using the database lookup but not exporting to a PROJ string gives reasonable accuracy:

list_coordOps("EPSG:4817", "EPSG:4326")
## Candidate coordinate operations found:  1 
## Strict containment:  FALSE 
## Visualization order:  TRUE 
## Source: EPSG:4817 
## Target: EPSG:4326 
## Best instantiable operation has accuracy: 3 m
## Description: axis order change (2D) + NGO 1948 (Oslo) to NGO1948 (1) + NGO
##              1948 to WGS 84 (1) + axis order change (2D)
## Definition:  +proj=pipeline +step +proj=unitconvert +xy_in=deg +xy_out=rad
##              +step +inv +proj=longlat +a=6377492.018
##              +rf=299.1528128 +pm=oslo +step +inv +proj=longlat
##              +a=6377492.018 +rf=299.1528128 +step +proj=push
##              +v_3 +step +proj=cart +a=6377492.018
##              +rf=299.1528128 +step +proj=helmert +x=278.3 +y=93
##              +z=474.5 +rx=7.889 +ry=0.05 +rz=-6.61 +s=6.21
##              +convention=position_vector +step +inv +proj=cart
##              +ellps=WGS84 +step +proj=pop +v_3 +step
##              +proj=unitconvert +xy_in=rad +xy_out=deg

Up to and including PROJ 5, downstream software, like sf and rgdal, have been able to rely on the provision of ad-hoc transformtion capabilities, with apparently predictable consequences. Everybody knew (or should have known) that each new release of the PROJ and GDAL CSV metadata files could update transformation coefficients enough to shift outcomes a little. Everyone further chose to ignore the timestamping of coordinates, or at least of datasets; we could guess (as above) that US Census tract boundaries for 1980 must use the NAD27 datum framework - suprisingly many used NAD83 anyway (both for Boston and the North Carolina SIDS data set).

Use of KML files to provide zoom and pan for these boundaries, and now leaflet and mapview exposes approximations mercilessly. Use of coefficients of transformation of an unknown degree of approximation, and authority “googled it” was reaching its limits, or likely had exceeded them.

sp classes used a PROJ string to define the CRS (in an S4 "CRS" object):

getClass("CRS")
## Class "CRS" [package "sp"]
## 
## Slots:
##                 
## Name:   projargs
## Class: character

while sf uses an S3 "crs" object with an integer EPSG code and a PROJ string; if instantiated from the EPSG code, both are provided, here for now retaining the fragile +towgs84= key because the central OGRSpatialReference function exportToProj4() is not (yet) being called (it is called when reading from file):

library(sf)
st_crs(4326)
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"

Modernising PROJ and issues

PROJ

Because so much open source (and other) software uses the PROJ library and framework, many are affected when PROJ upgrades. Until very recently, PROJ has been seen as very reliable, and the changes taking place now are intended to confirm and reinforce this reliability. Before PROJ 5 (PROJ 6 is out now, PROJ 7 is coming early in 2020), the +datum= tag was used, perhaps with +towgs84= with three or seven coefficients, and possibly +nadgrids= where datum transformation grids were available. However, transformations from one projection to another first inversed to longitude-latitude in WGS84, then projected on to the target projection.

Fast-forward 35 years and PROJ.4 is everywhere: It provides coordinate handling for almost every geospatial program, open or closed source. Today, we see a drastical increase in the need for high accuracy GNSS coordinate handling, especially in the agricultural and construction engineering sectors. This need for geodetic-accuracy transformations is not satisfied by “classic PROJ.4”. But with the ubiquity of PROJ.4, we can provide these transformations “everywhere”, just by implementing them as part of PROJ.4 (Evers and Knudsen 2017).

Escaping the WGS84 hub/pivot: PROJ and OGC WKT2

Following the introduction of geodetic modules and pipelines in PROJ 5 (Knudsen and Evers 2017; Evers and Knudsen 2017), PROJ 6 moves further. Changes in the legacy PROJ representation and WGS84 transformation hub have been coordinated through the GDAL barn raising initiative. Crucially WGS84 often ceases to be the pivot for moving between datums. A new OGC WKT is coming, and an SQLite EPSG file database has replaced CSV files. SRS will begin to support 3D by default, adding time too as SRS change. See also PROJ migration notes.

There are very useful postings on the PROJ mailing list from Martin Desruisseaux, first proposing clarifications and a follow-up including a summary:

  • “Early binding” ≈ hub transformation technique.
  • “Late binding” ≈ hub transformation technique NOT used, replaced by a more complex technique consisting in searching parameters in the EPSG database after the transformation context (source, target, epoch, area of interest) is known.
  • The problem of hub transformation technique is independent of WGS84. It is caused by the fact that transformations to/from the hub are approximate. Any other hub we could invent in replacement of WGS84 will have the same problem, unless we can invent a hub for which transformations are exact (I think that if such hub existed, we would have already heard about it).

The solution proposed by ISO 19111 (in my understanding) is:

  • Forget about hub (WGS84 or other), unless the simplicity of early-binding is considered more important than accuracy.
  • Associating a CRS to a coordinate set (geometry or raster) is no longer sufficient. A {CRS, epoch} tuple must be associated. ISO 19111 calls this tuple “Coordinate metadata”. From a programmatic API point of view, this means that getCoordinateReferenceSystem() method in Geometry objects (for instance) needs to be replaced by a getCoordinateMetadata() method.

Upstream software dependencies of the R-spatial ecosystem

When changes occur in upstream external software, R packages using these libraries often need to adapt, but package maintainers try very hard to shield users from any consequences, so that legacy workflows continue to provide the same or at least similar results from the same data.

The code shown in (Bivand, Pebesma, and Gomez-Rubio 2008, 2013) is almost all run nightly on a platform with updated R packages and external software.

This does not necessarily trap all differences (figures are not compared), but is helpful in detecting impacts of changes in packages or external software.

It is also very helpful that CRAN servers using the released and development versions of R, and with different levels of external software also run nightly checks.

Again, sometimes changes are only noticed by users, but quite often checks run by maintainers and by CRAN alert us to impending challenges.

Tracking the development mailing lists of the external software communities, all open source, can also show how thinking is evolving, although sometimes code tidying in external software can have unexpected consequences, breaking not sf or sp with rgdal or rgeos, but a package further downstream.

(Bivand 2014) discusses open source geospatial software stacks more generally, but here we will consider ongoing changes in PROJ.

(Knudsen and Evers 2017; Evers and Knudsen 2017) not only point out how the world has changed since a World Geodetic System of 1984 (WGS84) was adopted as a hub for coordinate transformation in PROJ, but also introduced transformation pipelines.

In using a transformation hub, PROJ had worked adequately when the errors introduced by transforming first to WGS84 and then from WGS84 to the target coordinate reference system, but with years passing from 1984, the world has undergone sufficient tectonic shifts for errors to increase.

In addition, the need for precision has risen in agriculture and engineering. So PROJ, as it was, risked ceasing to be fit for purpose as a fundamental component of the geospatial open source software stack.

Following major changes in successive iterations of the international standards for coordinate reference systems (ISO 2004a), PROJ is changing from preferring “late-binding” transformations, pivoting through a known transformation hub in going from input to target coordinate reference systems, to “early-binding” transformations.

This means that the user may be offered alternative paths from input to target coordinate reference systems, some of which may go directly, and more will use higher precision transformation grids, enlarging the existing practice of using North American Datum (NAD) grids.

In other cases, three or seven coefficient transformations may be offered, but the default fallback, where little is known about the input or target specification, may be less satisfactory than PROJ has previously offered.

PROJ will also become more tightly linked to authorities responsible for the specification components. While the original well-known text (WKT1) descriptions also contained authorities, WKT2-2018 is substantially more stringent. PROJ continues to use the European Petroleum Survey Group (EPSG) database, the local copy PROJ uses is now an SQLite database, with a large number of tables:

library(RSQLite)
db <- dbConnect(SQLite(), dbname="/usr/local/share/proj/proj.db")
cat(strwrap(paste(dbListTables(db), collapse=", ")), sep="\n")
## alias_name, area, authority_list, authority_to_authority_preference,
## axis, celestial_body, compound_crs, concatenated_operation, conversion,
## conversion_method, conversion_param, conversion_table,
## coordinate_operation_method, coordinate_operation_view,
## coordinate_operation_with_conversion_view, coordinate_system, crs_view,
## deprecation, ellipsoid, geodetic_crs, geodetic_datum,
## grid_alternatives, grid_packages, grid_transformation,
## helmert_transformation, helmert_transformation_table, metadata,
## object_view, other_transformation, prime_meridian, projected_crs,
## supersession, unit_of_measure, vertical_crs, vertical_datum
dbDisconnect(db)

Grid CDN mechanism

Current discussions now relate to mechanisms for caching downloaded grids, and advertising their availability to all programs using PROJ, for example GRASS GIS or QGIS.

Up to now, PROJ metadata files have usually been stored in a directory with only read access for users.

New facilities have been added to add to the search path for PROJ metadata files, but downloading often bulky grid files on-the-fly is not seen as a sensible use of resources.

Transformation pipelines

In addition, the current iteration of the standard makes it more important to declare the epoch of interest of coordinates (when the position was recorded and how) and the region of interest.

A transformation pathway may have an undefined epoch and a global span, but cannot achieve optimal precision everywhere.

By bounding the region of interest say within a tectonic plate, and the epoch to a given five-year period, very high precision transformations may be possible.

These choices have not so far been required explicitly, but for example matching against the "area" table in the database may reduce the number of transformation pathways offered dramatically.

CRS status before GDAL3 and PROJ6

The initial use of coordinate reference systems for objects defined ib sp was based on the PROJ.4 string representation, which built on a simplified key=value form.

Keys began with plus (+), and the value format depended on the key.

If essential keys were missing, some might be added by default from a file that has now been eliminated as misleading; if +ellps= was missing and not added internally from other keys, +ellps=WGS84 would be added silently.

Accurate coordinate transformation has always been needed for the integration of data from different sources, but has become much more pressing as web mapping has become available in R, through the leaflet package (Cheng, Karambelkar, and Xie 2018), on which mapview and the "view" mode of tmap.

As web mapping provides zooming and panning, possible infelicities that were too small to detect as mismatches in transformation jump into prominence.

The web mapping workflow transforms input objects to EPSG:4326 (geographical CRS WGS 84, World area of relevance, WGS84 datum) as expected by leaflet, then on to EPSG:3857 (WGS 84 / Pseudo-Mercator) for display on web map backgrounds (this is carried out internally in leaflet.

We’ll be using the Soho cholera data set; I converted the shapefiles from https://asdar-book.org/bundles2ed/die_bundle.zip to GPKG to be more modern (using ogr2ogr in GDAL 3 built against PROJ 6. sf is installed using the proj.h interface in PROJ 6:

buildings <- sf::st_read("snow/buildings.gpkg", quiet=TRUE)
st_crs(buildings)
## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs"

To make an interactive display in mapview(), conversion/transformation to “Web Mercator” is needed - this uses a WGS84 datum. But PROJ 6 has dropped the +datum= tag, so the display is not correctly registered.

library(mapview)
mapview(buildings)

The CRS/SRS values in the GPKG file (it is a multi-table SQLite database) include the datum definition:

library(RSQLite)
db = dbConnect(SQLite(), dbname="snow/buildings.gpkg")
dbReadTable(db, "gpkg_spatial_ref_sys")$definition[4]
## [1] "PROJCS[\"Transverse_Mercator\",GEOGCS[\"GCS_OSGB 1936\",DATUM[\"OSGB_1936\",SPHEROID[\"Airy 1830\",6377563.396,299.3249646,AUTHORITY[\"EPSG\",\"7001\"]],AUTHORITY[\"EPSG\",\"6277\"]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.0174532925199433],AUTHORITY[\"EPSG\",\"4277\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",49],PARAMETER[\"central_meridian\",-2],PARAMETER[\"scale_factor\",0.999601],PARAMETER[\"false_easting\",400000],PARAMETER[\"false_northing\",-100000],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]"
dbDisconnect(db)

Maybe using rgdal which is built using PROJ 6 but the legacy proj_api.h interface, and the shapefile as shipped with ASDAR reproduction materials will help?

buildings1 <- rgdal::readOGR("snow/buildings.shp", verbose=FALSE)
## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
## dumpSRS): Discarded datum OSGB_1936 in CRS definition: +proj=tmerc +lat_0=49
## +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
## Warning in OGRSpatialRef(dsn = dsn, layer = layer, morphFromESRI =
## morphFromESRI, : Discarded datum OSGB_1936 in CRS definition: +proj=tmerc
## +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m
## +no_defs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum Unknown_based_on_Airy_1830_ellipsoid in CRS definition
proj4string(buildings1)
## [1] "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs"

No, same problem:

mapview(buildings1)

But the shapefile has the datum definition:

readLines("snow/buildings.prj")
## [1] "PROJCS[\"Transverse_Mercator\",GEOGCS[\"GCS_OSGB 1936\",DATUM[\"D_OSGB_1936\",SPHEROID[\"Airy_1830\",6377563.396,299.3249646]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.017453292519943295]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",49.00000000000002],PARAMETER[\"central_meridian\",-2],PARAMETER[\"scale_factor\",0.999601],PARAMETER[\"false_easting\",400000],PARAMETER[\"false_northing\",-100000],UNIT[\"Meter\",1]]"

There are a number of components to the PROJ/GDAL changes taking place.

One concerns the use of transformation pipelines to represent coordinate operations. These pipelines (there may be many candidates) vary by area of interest, accuracy of transformation coordinates if used, and the availability of grids.

A second component concerns the representation of CRS; if CRS are represented by PROJ strings, and go through the GDAL function OGRSpatialReference exportToProj4(), most +datum= tags will be stripped (see function documentation).

A third component adds area of interest and possibly epoch to the WKT2_2018 version of ISO 19111 as a forward-looking text representation of a CRS.

Proposed developments (using sp and rgdal as prototypes)

The current proposals now exposed in my fork of sp on github (>= 1.3-3) and the development version of rgdal on R-Forge involve the following steps, over and above backward compatibility (no change in handling CRS for PROJ < 6 and GDAL < 3; try to handle wrong case of GDAL < 3 with PROJ 6):

For PROJ >= 6 && GDAL >= 3: supplement the "CRS" PROJ string with a full WKT2_2018 representation. If the object is instantiated by reading a file through GDAL, then exportToProj4() will degrade most CRS when creating a PROJ string. The WKT string is stored as a comment() to the "CRS" object, which is permissable but not desirable in an S4 context, but is backwards compatible. We can see that the WKT string represents the CRS seen in the vector file:

comment(slot(buildings1, "proj4string"))
## [1] "PROJCRS[\"Transverse_Mercator\",BASEGEOGCRS[\"GCS_OSGB 1936\",DATUM[\"OSGB 1936\",ELLIPSOID[\"Airy 1830\",6377563.396,299.3249646,LENGTHUNIT[\"metre\",1]],ID[\"EPSG\",6277]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"Degree\",0.0174532925199433]]],CONVERSION[\"unnamed\",METHOD[\"Transverse Mercator\",ID[\"EPSG\",9807]],PARAMETER[\"Latitude of natural origin\",49,ANGLEUNIT[\"Degree\",0.0174532925199433],ID[\"EPSG\",8801]],PARAMETER[\"Longitude of natural origin\",-2,ANGLEUNIT[\"Degree\",0.0174532925199433],ID[\"EPSG\",8802]],PARAMETER[\"Scale factor at natural origin\",0.999601,SCALEUNIT[\"unity\",1],ID[\"EPSG\",8805]],PARAMETER[\"False easting\",400000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8806]],PARAMETER[\"False northing\",-100000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8807]]],CS[Cartesian,2],AXIS[\"(E)\",east,ORDER[1],LENGTHUNIT[\"metre\",1,ID[\"EPSG\",9001]]],AXIS[\"(N)\",north,ORDER[2],LENGTHUNIT[\"metre\",1,ID[\"EPSG\",9001]]]]"

Using the previous direct instantiation mechanism, we see that the PROJ string is degraded

(o <- CRS("+init=epsg:27700"))
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum OSGB_1936 in CRS definition
## CRS arguments:
##  +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +ellps=airy +units=m +no_defs

but that the WKT2 payload is safe, and is actually better specified than that from file:

comment(o)
## [1] "PROJCRS[\"OSGB 1936 / British National Grid\",BASEGEOGCRS[\"OSGB 1936\",DATUM[\"OSGB 1936\",ELLIPSOID[\"Airy 1830\",6377563.396,299.3249646,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],ID[\"EPSG\",4277]],CONVERSION[\"British National Grid\",METHOD[\"Transverse Mercator\",ID[\"EPSG\",9807]],PARAMETER[\"Latitude of natural origin\",49,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8801]],PARAMETER[\"Longitude of natural origin\",-2,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8802]],PARAMETER[\"Scale factor at natural origin\",0.9996012717,SCALEUNIT[\"unity\",1],ID[\"EPSG\",8805]],PARAMETER[\"False easting\",400000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8806]],PARAMETER[\"False northing\",-100000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8807]],ID[\"EPSG\",19916]],CS[Cartesian,2],AXIS[\"(E)\",east,ORDER[1],LENGTHUNIT[\"metre\",1,ID[\"EPSG\",9001]]],AXIS[\"(N)\",north,ORDER[2],LENGTHUNIT[\"metre\",1,ID[\"EPSG\",9001]]],USAGE[SCOPE[\"unknown\"],AREA[\"UK - Britain and UKCS 49°46'N to 61°01'N, 7°33'W to 3°33'E\"],BBOX[49.75,-9.2,61.14,2.88]]]"

The new rgdal::showSRID() function replaces the legacy rgdal::showWKT(), and is used internally in `rgdal::checkCRSArgs_ng(), which gets an additional argument to pass through a CRS string in a different format:

cat(showSRID("+init=epsg:27700", multiline="YES"), "\n")
## PROJCRS["OSGB 1936 / British National Grid",
##     BASEGEOGCRS["OSGB 1936",
##         DATUM["OSGB 1936",
##             ELLIPSOID["Airy 1830",6377563.396,299.3249646,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4277]],
##     CONVERSION["British National Grid",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",49,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-2,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996012717,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",400000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",-100000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]],
##         ID["EPSG",19916]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1,
##                 ID["EPSG",9001]]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1,
##                 ID["EPSG",9001]]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["UK - Britain and UKCS 49°46'N to 61°01'N, 7°33'W to 3°33'E"],
##         BBOX[49.75,-9.2,61.14,2.88]]]
showSRID("+init=epsg:27700", "PROJ")
## Warning in showSRID("+init=epsg:27700", "PROJ"): Discarded datum OSGB_1936 in
## CRS definition
## [1] "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs"

While previously rgdal::checkCRSArgs() was called by sp::CRS() and checked or expanded the PROJ string, in the GDAL >= 3 and PROJ >= 6 setting, the new function will use rgdal::showSRID() to provide both a checked PROJ string and a WKT2 string, which are then used to populate the "CRS" object and its comment.

checkCRSArgs_ng
## function (uprojargs = NA_character_, SRS_string = NULL) 
## {
##     no_SRS <- is.null(SRS_string)
##     no_PROJ <- is.na(uprojargs)
##     res <- vector(mode = "list", length = 3L)
##     res[[1]] <- FALSE
##     res[[2]] <- NA_character_
##     res[[3]] <- NA_character_
##     if (!no_SRS) {
##         stopifnot(is.character(SRS_string))
##         stopifnot(length(SRS_string) == 1L)
##     }
##     if (!no_PROJ) {
##         stopifnot(is.character(uprojargs))
##         stopifnot(length(uprojargs) == 1L)
##     }
##     if (!no_SRS) {
##         uprojargs1 <- try(showSRID(SRS_string, format = "PROJ", 
##             multiline = "NO"))
##         if (inherits(uprojargs1, "try-error")) {
##             res[[1]] <- FALSE
##             res[[2]] <- NA_character_
##         }
##         else {
##             res[[1]] <- TRUE
##             res[[2]] <- uprojargs1
##         }
##         wkt2 <- try(showSRID(SRS_string, format = "WKT2", multiline = "NO"))
##         if (!inherits(wkt2, "try-error")) 
##             res[[3]] <- wkt2
##     }
##     else if (!no_PROJ) {
##         uprojargs <- sub("^\\s+", "", uprojargs)
##         uprojargs1 <- try(showSRID(uprojargs, format = "PROJ", 
##             multiline = "NO"))
##         if (inherits(uprojargs1, "try-error")) {
##             res[[1]] <- FALSE
##             res[[2]] <- NA_character_
##         }
##         else {
##             res[[1]] <- TRUE
##             res[[2]] <- uprojargs1
##         }
##         wkt2 <- try(showSRID(uprojargs, format = "WKT2", multiline = "NO"))
##         if (!inherits(wkt2, "try-error")) 
##             res[[3]] <- wkt2
##     }
##     res
## }
## <bytecode: 0xa4f1c38>
## <environment: namespace:rgdal>

Because sp::CRS() is called whenever an object is created, for example when reading from a file, newly instantiated "Spatial" objects should receive updated "CRS" objects with WKT2 comments.

The "Spatial" objects that will not receive updated "CRS" objects are those that have been serialized, as "RDA" or "RDS" objects, for example in package data directories, or for example from (https://gadm.org).

download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsp/gadm36_NOR_0_sp.rds", "gadm36_NOR_0_sp.rds")
nor <- readRDS("gadm36_NOR_0_sp.rds")
proj4string(nor)
## [1] "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
comment(slot(nor, "proj4string"))
## NULL

There is no automatic way to update these "CRS" objects, so user intervention will be needed to avoid possible degradation:

(o <- CRS(proj4string(nor)))
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum WGS_1984 in CRS definition,
##  but +towgs84= values preserved
## CRS arguments:
##  +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs
comment(o)
## [1] "BOUNDCRS[SOURCECRS[GEOGCRS[\"unknown\",DATUM[\"World Geodetic System 1984\",ELLIPSOID[\"WGS 84\",6378137,298.257223563,LENGTHUNIT[\"metre\",1]],ID[\"EPSG\",6326]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8901]],CS[ellipsoidal,2],AXIS[\"longitude\",east,ORDER[1],ANGLEUNIT[\"degree\",0.0174532925199433,ID[\"EPSG\",9122]]],AXIS[\"latitude\",north,ORDER[2],ANGLEUNIT[\"degree\",0.0174532925199433,ID[\"EPSG\",9122]]]]],TARGETCRS[GEOGCRS[\"WGS 84\",DATUM[\"World Geodetic System 1984\",ELLIPSOID[\"WGS 84\",6378137,298.257223563,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],CS[ellipsoidal,2],AXIS[\"latitude\",north,ORDER[1],ANGLEUNIT[\"degree\",0.0174532925199433]],AXIS[\"longitude\",east,ORDER[2],ANGLEUNIT[\"degree\",0.0174532925199433]],ID[\"EPSG\",4326]]],ABRIDGEDTRANSFORMATION[\"Transformation from unknown to WGS84\",METHOD[\"Geocentric translations (geog2D domain)\",ID[\"EPSG\",9603]],PARAMETER[\"X-axis translation\",0,ID[\"EPSG\",8605]],PARAMETER[\"Y-axis translation\",0,ID[\"EPSG\",8606]],PARAMETER[\"Z-axis translation\",0,ID[\"EPSG\",8607]]]]"

The new function rgdal::list_coordOps() can be used to explore what alternatives are returned on searching the proj.db database and the grids present on the running platform. If all we are using is the degraded PROJ string, we only find one ballpark alternative, leading to the mis-placement of buildings in Soho (mapview now converts all "Spatial" objects to their sf equivalents, so buildings1 is also mis-placed):

(c0 <- list_coordOps(paste0(proj4string(buildings1), " +type=crs"), "EPSG:4326"))
## Candidate coordinate operations found:  1 
## Strict containment:  FALSE 
## Visualization order:  TRUE 
## Source: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000
##         +y_0=-100000 +ellps=airy +units=m +no_defs +type=crs
## Target: EPSG:4326 
## Best instantiable operation has only ballpark accuracy 
## Description: Inverse of unknown + Ballpark geographic offset from unknown to
##              WGS 84 + axis order change (2D)
## Definition:  +proj=pipeline +step +inv +proj=tmerc +lat_0=49 +lon_0=-2
##              +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy
##              +step +proj=unitconvert +xy_in=rad +xy_out=deg

If instead we use the WKT comment, we get 7 alternatives, with the best available having 2m accuracy. A 1m accuracy alternative could be available if a grid (URL given) is downloaded and installed (this will follow later).

(c1 <- list_coordOps(comment(slot(buildings1, "proj4string")), "EPSG:4326"))
## Candidate coordinate operations found:  7 
## Strict containment:  FALSE 
## Visualization order:  TRUE 
## Source: PROJCRS["Transverse_Mercator", BASEGEOGCRS["GCS_OSGB 1936",
##         DATUM["OSGB 1936", ELLIPSOID["Airy 1830", 6377563.396,
##         299.3249646, LENGTHUNIT["metre", 1]], ID["EPSG",
##         6277]], PRIMEM["Greenwich", 0, ANGLEUNIT["Degree",
##         0.0174532925199433]]], CONVERSION["unnamed",
##         METHOD["Transverse Mercator", ID["EPSG", 9807]],
##         PARAMETER["Latitude of natural origin", 49,
##         ANGLEUNIT["Degree", 0.0174532925199433], ID["EPSG",
##         8801]], PARAMETER["Longitude of natural origin", -2,
##         ANGLEUNIT["Degree", 0.0174532925199433], ID["EPSG",
##         8802]], PARAMETER["Scale factor at natural origin",
##         0.999601, SCALEUNIT["unity", 1], ID["EPSG", 8805]],
##         PARAMETER["False easting", 400000, LENGTHUNIT["metre",
##         1], ID["EPSG", 8806]], PARAMETER["False northing",
##         -100000, LENGTHUNIT["metre", 1], ID["EPSG", 8807]]],
##         CS[Cartesian, 2], AXIS["(E)", east, ORDER[1],
##         LENGTHUNIT["metre", 1, ID["EPSG", 9001]]], AXIS["(N)",
##         north, ORDER[2], LENGTHUNIT["metre", 1, ID["EPSG",
##         9001]]]]
## Target: EPSG:4326 
## Best instantiable operation has accuracy: 2 m
## Description: Inverse of unnamed + OSGB 1936 to WGS 84 (6) + axis order
##              change (2D)
## Definition:  +proj=pipeline +step +inv +proj=tmerc +lat_0=49 +lon_0=-2
##              +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy
##              +step +proj=push +v_3 +step +proj=cart +ellps=airy
##              +step +proj=helmert +x=446.448 +y=-125.157
##              +z=542.06 +rx=0.15 +ry=0.247 +rz=0.842 +s=-20.489
##              +convention=position_vector +step +inv +proj=cart
##              +ellps=WGS84 +step +proj=pop +v_3 +step
##              +proj=unitconvert +xy_in=rad +xy_out=deg
## Operation 6 is lacking 1 grid with accuracy 1 m
## Missing grid: OSTN15_NTv2_OSGBtoETRS.gsb 
## URL: https://download.osgeo.org/proj/proj-datumgrid-europe-latest.zip

rgdal::spTransform() methods have been supplemented to use CRS comments if available, falling back on PROJ strings. The coordinate operation chosen (the best available on the running platform) is searched for and found once, and re-used for all the geometries in the object. The last coordinate operation used is also cached, but it would be up to the users to re-use this pipeline if desired (probably a class for pipeline objects is required):

buildings1_ll <- spTransform(buildings1, CRS("+init=epsg:4326"))
get(".last_coordOp", envir=rgdal:::.RGDAL_CACHE)
## [1] "+proj=pipeline +step +inv +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +step +proj=push +v_3 +step +proj=cart +ellps=airy +step +proj=helmert +x=446.448 +y=-125.157 +z=542.06 +rx=0.15 +ry=0.247 +rz=0.842 +s=-20.489 +convention=position_vector +step +inv +proj=cart +ellps=WGS84 +step +proj=pop +v_3 +step +proj=unitconvert +xy_in=rad +xy_out=deg"

In these cases we have not so far been further confused by axis swapping - we do not yet know how this may affect workflows.

Having transformed the building outlines using the CRS WKT comment, we have retrieved 2m accuracy:

mapview(buildings1_ll)

Let’s try the Broad Street pump itself: does the proposed solution deliver a work-around (and arguably a robust solution going forward)?

bp <- readOGR("snow/b_pump.gpkg")
## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
## dumpSRS): Discarded datum OSGB_1936 in CRS definition: +proj=tmerc +lat_0=49
## +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
## OGR data source with driver: GPKG 
## Source: "/home/rsb/presentations/uw19/snow/b_pump.gpkg", layer: "b_pump"
## with 1 features
## It has 1 fields
## Integer64 fields read as strings:  cat
## Warning in OGRSpatialRef(dsn = dsn, layer = layer, morphFromESRI =
## morphFromESRI, : Discarded datum OSGB_1936 in CRS definition: +proj=tmerc
## +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m
## +no_defs
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum Unknown_based_on_Airy_1830_ellipsoid in CRS definition
comment(slot(bp, "proj4string"))
## [1] "PROJCRS[\"Transverse_Mercator\",BASEGEOGCRS[\"GCS_OSGB 1936\",DATUM[\"OSGB 1936\",ELLIPSOID[\"Airy 1830\",6377563.396,299.3249646,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"Degree\",0.0174532925199433]],ID[\"EPSG\",4277]],CONVERSION[\"unnamed\",METHOD[\"Transverse Mercator\",ID[\"EPSG\",9807]],PARAMETER[\"Latitude of natural origin\",49,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8801]],PARAMETER[\"Longitude of natural origin\",-2,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8802]],PARAMETER[\"Scale factor at natural origin\",0.999601,SCALEUNIT[\"unity\",1],ID[\"EPSG\",8805]],PARAMETER[\"False easting\",400000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8806]],PARAMETER[\"False northing\",-100000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8807]]],CS[Cartesian,2],AXIS[\"easting\",east,ORDER[1],LENGTHUNIT[\"metre\",1,ID[\"EPSG\",9001]]],AXIS[\"northing\",north,ORDER[2],LENGTHUNIT[\"metre\",1,ID[\"EPSG\",9001]]]]"

The untreated view (coercion to sf (0.8-0, not wkt2 branch) ignoring CRS comment then ballpark transformation to 4326 before internal conversion to pseudo-Mercator):

mapview(bp)

If we transform using the CRS WKT comment, we retrieve the pre-PROJ6/GDAL3 position, but can sharpen this if we download and use a higher precision grid:

mapview(spTransform(bp, CRS("+init=epsg:4326")))

Last week we saw the first case in the wild of axis swapping (some CRS are defined by relevant authorities as latitude, longitude).

A further problem is that packages like raster and mapview use verbatim PROJ strings to check CRS equivalence. This has not been resolved.

Akima, Hiroshi, and Albrecht Gebhardt. 2016. akima: Interpolation of Irregularly and Regularly Spaced Data. https://CRAN.R-project.org/package=akima.

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press. http://www.crcpress.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/9781482210200/.

Baddeley, Adrian, and Rolf Turner. 2005. “spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (6): 1–42. http://www.jstatsoft.org/v12/i06/.

Baddeley, Adrian, Rolf Turner, and Ege Rubak. 2019. Spatstat: Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests. https://CRAN.R-project.org/package=spatstat.

Bivand, R. 2002. “Spatial Econometrics Functions in R: Classes and Methods.” Journal of Geographical Systems 4: 405–21.

———. 2006. “Implementing Spatial Data Analysis Software Tools in R.” Geographical Analysis 38: 23–40.

Bivand, Roger. 2000. “Using the R Statistical Data Analysis Language on GRASS 5.0 GIS Database Files.” Computers & Geosciences 26 (9): 1043–52.

———. 2014. “Geocomputation and Open Source Software: Components and Software Stacks.” In Geocomputation, edited by Robert J. Abrahart and Linda M. See, 329–55. Boca Raton: CRC Press. http://hdl.handle.net/11250/163358.

Bivand, Roger, and Albrecht Gebhardt. 2000. “Implementing Functions for Spatial Statistical Analysis Using the R Language.” Journal of Geographical Systems 2 (3): 307–17. https://doi.org/10.1007/PL00011460.

Bivand, Roger, and Nicholas Lewin-Koh. 2019. Maptools: Tools for Handling Spatial Objects. https://CRAN.R-project.org/package=maptools.

Bivand, Roger, Edzer Pebesma, and Virgilio Gomez-Rubio. 2008. Applied Spatial Data Analysis with R. Springer, NY. https://asdar-book.org/.

———. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY. https://asdar-book.org/.

Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2018. Leaflet: Create Interactive Web Maps with the Javascript ’Leaflet’ Library. https://CRAN.R-project.org/package=leaflet.

de Vries, Andrie. 2014. “Finding Clusters of CRAN Packages Using Igraph.” https://blog.revolutionanalytics.com/2014/12/finding-clusters-of-cran-packages-using-igraph.html.

Evers, Kristian, and Thomas Knudsen. 2017. Transformation Pipelines for Proj.4. https://www.fig.net/resources/proceedings/fig\_proceedings/fig2017/papers/iss6b/ISS6B\_evers\_knudsen\_9156.pdf.

Gómez-Rubio, V., J. Ferrándiz-Ferragud, and A. Lopez-Quílez. 2005. “Detecting Clusters of Disease with R.” Journal of Geographical Systems 7 (2): 189–206.

Gómez-Rubio, Virgilio. 2011. RArcInfo: Functions to Import Data from Arc/Info V7.x Binary Coverages. https://CRAN.R-project.org/package=RArcInfo.

Gómez-Rubio, Virgilio, Juan Ferrándiz-Ferragud, and Antonio López-Quílez. 2015. DCluster: Functions for the Detection of Spatial Clusters of Diseases. https://CRAN.R-project.org/package=DCluster.

Gómez-Rubio, V., and A. López-Quílez. 2005. “RArcInfo: Using Gis Data with R.” Computers & Geosciences 31 (8): 1000–1006.

Herring, John R. 2011. “OpenGIS Implementation Standard for Geographic Information-Simple Feature Access-Part 1: Common Architecture.” Open Geospatial Consortium Inc, 111. http://portal.opengeospatial.org/files/?artifact\_id=25355.

ISO. 2004a. ISO 19111:2019 Geographic Information – Referencing by Coordinates. https://www.iso.org/standard/74039.html.

———. 2004b. ISO 19125-1:2004 Geographic Information – Simple Feature Access – Part 1: Common Architecture. https://www.iso.org/standard/40114.html.

Kaluzny, S.P., S.C. Vega, T.P. Cardoso, and A.A. Shelly. 1998. S+SpatialStats. New York, NY: Springer.

Knudsen, Thomas, and Kristian Evers. 2017. Transformation Pipelines for Proj.4. https://meetingorganizer.copernicus.org/EGU2017/EGU2017-8050.pdf.

Kralidis, Athanasios Tom. 2008. “Geospatial Open Source and Open Standards Convergences.” In Open Source Approaches in Spatial Data Handling, edited by G. B. Hall and M. Leahy, 1–20. Berlin: Springer-Verlag.

Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. Boca Raton, FL: Chapman and Hall/CRC. https://geocompr.robinlovelace.net/.

Majure, James J., and Albrecht Gebhardt. 2016. sgeostat: An Object-Oriented Framework for Geostatistical Modeling in S+. https://CRAN.R-project.org/package=sgeostat.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

Pebesma, Edzer, Roger Bivand, and Paulo Ribeiro. 2015. “Software for Spatial Statistics.” Journal of Statistical Software, Articles 63 (1): 1–8. https://doi.org/10.18637/jss.v063.i01.

Pebesma, E. J., and C. G. Wesseling. 1998. “Gstat, a Program for Geostatistical Modelling, Prediction and Simulation.” Computers and Geosciences 24: 17–31.

Renka, R. J., and Albrecht Gebhardt. 2016. tripack: Triangulation of Irregularly Spaced Data. https://CRAN.R-project.org/package=tripack.

Rowlingson, Barry, and Peter Diggle. 2017. Splancs: Spatial and Space-Time Point Pattern Analysis. https://CRAN.R-project.org/package=splancs.

Rowlingson, B., and P. J. Diggle. 1993. “Splancs: Spatial Point Pattern Analysis Code in S-Plus.” Computers and Geosciences 19: 627–55.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

Warmerdam, Frank. 2008. “The Geospatial Data Abstraction Library.” In Open Source Approaches in Spatial Data Handling, edited by G. B. Hall and M. Leahy, 87–104. Berlin: Springer-Verlag.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.